180 research outputs found
Applying A Normalized Compression Metric To The Measurement Of Dialect Distance
The paper discusses the application of a similarity metric based
on compression to the measurement of the distance among Bulgarian dia-
lects. The similarity metric is de ned on the basis of the notion of Kolmo-
gorov complexity of a le (or binary string). The application of Kolmogorov
complexity in practice is not possible because its calculation over a le is an
undecidable problem. Thus, the actual similarity metric is based on a real life
compressor which only approximates the Kolmogorov complexity. To use the
metric for distance measurement of Bulgarian dialects we rst represent the
dialectological data in such a way that the metric is applicable. We propose
two such representations which are compared to a baseline distance between
dialects. Then we conclude the paper with an outline of our future work
Using the linguistic knowledge in BulTreeBank for the selection of the correct parses
Proceedings of the Ninth International Workshop
on Treebanks and Linguistic Theories.
Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti.
NEALT Proceedings Series, Vol. 9 (2010), 163-174.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15891
The Role of Language Technologies in Digital Humanities (The Case of Parliamentary Debates)
The paper focuses on the use case of parliamentary debates as part of
Digital Humanities. First, the ParlaMint project is outlined as a flagship initiative of CLARIN ERIC infrastructure. The project makes content from the national and regional parliaments visible, comparable and accessible for policy making and research. Then, the approaches are considered that have been applied in the creation of 31 corpora from national and regional parliaments. Last but not least, the utility of the multilingual resource is discussed
The data-driven Bulgarian WordNet: BTBWN
The data-driven Bulgarian WordNet: BTBWNThe paper presents our work towards the simultaneous creation of a data-driven WordNet for Bulgarian and a manually annotated treebank with semantic information. Such an approach requires synchronization of the word senses in both - syntactic and lexical resources, without limiting the WordNet senses to the corpus or vice versa. Our strategy focuses on the identification of senses used in BulTreeBank, but the missing senses of a lemma also have been covered through exploration of bigger corpora. The identified senses have been organized in synsets for the Bulgarian WordNet. Then they have been aligned to the Princeton WordNet synsets. Various types of mappings are considered between both resources in a cross-lingual aspect and with respect to ensuring maximum connectivity and potential for incorporating the language specific concepts. The mapping between the two WordNets (English and Bulgarian) is a basis for applications such as machine translation and multilingual information retrieval. Oparty na danych WordNet bułgarski: BTBWNW artykule przedstawiono naszą pracę na rzecz jednoczesnej budowy opartego na danych wordnetu dla języka bułgarskiego oraz ręcznie oznaczonego informacjami semantycznymi banku drzew. Takie podejście wymaga uzgodnienia znaczeń słów zarówno w zasobach składniowych, jak i leksykalnych, bez ograniczania znaczeń umieszczanych w wordnecie do tych obecnych w korpusie, jak i odwrotnie. Nasza strategia koncentruje się na identyfikacji znaczeń stosowanych w BulTreeBank, przy czym brakujące znaczenia lematu zostały również zbadane przez zgłębienie większych korpusów. Zidentyfikowane znaczenia zostały zorganizowane w synsety bułgarskiego wordnetu, a następnie powiązane z synsetami Princeton WordNet. Rozmaite rodzaje rzutowań są rozpatrywane pomiędzy obydwoma zasobami w kontekście międzyjęzykowym, a także w odniesieniu do zapewnienia maksymalnej łączności i możliwości uwzględnienia pojęć specyficznych dla języka bułgarskiego. Rzutowanie między dwoma wordnetami (angielskim i bułgarskim) jest podstawą dla aplikacji, takich jak tłumaczenie maszynowe i wielojęzyczne wyszukiwanie informacji
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
We present bgGLUE (Bulgarian General Language Understanding Evaluation), a
benchmark for evaluating language models on Natural Language Understanding
(NLU) tasks in Bulgarian. Our benchmark includes NLU tasks targeting a variety
of NLP problems (e.g., natural language inference, fact-checking, named entity
recognition, sentiment analysis, question answering, etc.) and machine learning
tasks (sequence labeling, document-level classification, and regression). We
run the first systematic evaluation of pre-trained language models for
Bulgarian, comparing and contrasting results across the nine tasks in the
benchmark. The evaluation results show strong performance on sequence labeling
tasks, but there is a lot of room for improvement for tasks that require more
complex reasoning. We make bgGLUE publicly available together with the
fine-tuning and the evaluation code, as well as a public leaderboard at
https://bgglue.github.io/, and we hope that it will enable further advancements
in developing NLU models for Bulgarian.Comment: Accepted to ACL 2023 (Main Conference
- …